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(55) Signal bias removal for robust telephone speech recognition. 

(57) A signal bias removal (SBR) method based on the maximum likelihood estimation of the bias for 
minimizing undesirable effects in speech recognition systems is described. The technique is readily 
applicable in various architectures including discrete (vector-quantization based), semiconttnuous and 
continuous-density Hidden Markov Model (HMM) systems. For example, the SBR method can be 
integrated into a discrete density HMM and applied to telephone speech recognition where the 
contamination due to extraneous signal components is unknown. To enable real-time implementation, 
a sequential method for the estimation of the bias (SSBR) is disclosed. 
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Technical Field 

contaminate the speech signal in a telephone channel. 
Back ground of the Invention 

A speech signa. transmitted through a telephone channe, ^S^^lS^X^^ 
which significantlydeteriorate the performance of ^^^^S riSmnd interference, as well as dif- 

real-world applications. spectrum of a real noise signal, such 

An effect of a typical te.ephone channel » that * ^.^^^""STf BtBrl J action is not made con- 
3200 Hz, with variable attenuations across g^*^^ 0 ^gq U 'J^ces on the performance may result. 

Changes in articu.ation usually occur due to environmental concern in telephone speech 

mav occur merely when speaking to a mach.ne. Art.culat.or , effects are am j, ^ ^ ^ 

recognition, especially in situations where, tor exam^, ™« 

nizerf rom a public phone-booth situated near a "^J" ^ for fobust speecn ^cognition have centered 
Prior art efforts to minimize ^^T^^S^ZS^ estimate of the noise. Typical ex- 
upon three major areas. First, process.ng the ^hsgnal to re m robu8 t featu re analysis. Sec- 

amp.es include spectra, subtraction. Third, applying a robust 

35 Summary of the invention 

performing feature analysis on a framing s peech s ^^^Sa^, f rom tne spe ech signal to arrive at a 
Ling a likelihood function. Next. the estimate from the signal are 

tentative speech value. Comput.ng the ^^^*i^ a ^ th . pwl ^.t»^8p^valuetooom- 
repeatedapredetermined number of times, and ^ Next , a codebook of centroids is 

pi the next bfcs estimate to arrive *«" «- tented speech signal 

generated, and then the est.mate of the b.as bias speech signal value and optimal cen- 
again until an optimal set of centro.ds are generated. The "J** 1 ^ hase tnen consists 0 f utilizing 
troids are then used as the training input to the speech J^SJ^ utterance based on max- 

tha codebook generated during train.ng to «mp* « * ™ m tne speech signal to obtain a tentative 
imiz ing a likelihood function. ^^^^^J^Si, times to result In a reduced bias speech 
speech value, and then repeat.ng these t«»atopaa prese recognizer, 
value. The reduced bias speech va.ue » than used as the ^ jed after . speech recognition 

Afurther aspect of the present invent™ - thrtthaSB Rme tto. m ay PP oniy 
system has undergone any suitable training phase. Thus. SBR can 

the s 4i:r ^^^^ SBR method (SSBR) "** permrts rea, ' time 
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implementation of the invention on current platform recognizers without any major structural change. The 
SSBR method enables the processing of the bias at the frame level, rather than at the utterance level, without 
imposing a look-ahead frame delay. 

The SBR and SSBR methods are readily applicable to various HMM architectures, such as discrete (Vector 
5 Quantization-based), semi-continuous and continuous density HMM systems. In addition, for both the SBR 
and the SSBR methods, bias removal is carried out as an independent process following feature analysis and 
preceding recognition. Thus, the present invention may be integrated into a discrete density HMM system used 
for telephone speech recognition. 

10 Brief Description of the Drawings 

Fig. 1 is a block diagram of a distorted telephone network; 

Fig. 2Ais a block diagram illustrating the integration of the signal bias removal technique of the present 
invention in a speech recognition system; 
15 Fig. 2B is a block diagram illustrating the integration of the signal bias removal technique of the present 

invention in an HMM-based recognition system; 

Fig. 3 is a flowchart illustrating the use of the invention when training a speech recognition system; 
Fig. 4 is a flowchart illustrating the use of the invention when testing an input speech signal in a speech 
recognition system; 

20 Fig. 5A is a block diagram illustrating an apparatus incorporating the present invention; 

Fig. 5B is a block diagram illustrating another apparatus incorporating the present invention; 

Fig. 6 is a table illustrating the percentage word error and rates of insertion, deletion and subtraction for 

a baseline system with either SBR or CMS; 

Fig. 7 is a plot of the norm of the cepstral bias averaged over the training and the test data at every iteration 
25 of the SBR method; 

Fig. 8 is a plot of the quantization error averaged over the training and the test data at every iteration of 
the SBR method; 

Fig. 9 is a histogram for the second bias coefficient when testing on a first database (DB1) ; 
Fig. 10 is a histogram for the second bias coefficient when testing on a second database (DB2) ; 
30 Fig. 11 is a histogram of the second bias coefficient when testing on DB2 following ten iterations of the 

SBR method; 

Fig. 12 is a table illustrating the percentage word error for the baseline system with SBR, SSBR, or CMS; 
Fig. 1 3 is a plot of the word error rate as a function of the code book size when using SBR; 
Fig. 14 is a plot of the word error rate as a function of the code book size when using SSBR; and 
35 Fig. 1 5 is a table illustrating the percentage word error for different string lengths before and after utilizing 

SBR. 

Detailed Description 

40 Fig. 1 is a schematic block diagram of a distorted telephone network 1 . A telephone speech signal X(<o) 

encounters a distortion effect having a multiplicative component, H(cd), due to distortion in the telephone chan- 
nel 2, and an additive component, N(cd), representative of the ambient noise. If X(o>) is the power spectrum of 
the original speech signal, then the received contaminated signal, Y(o), is modeled as: 

Y(o) = H(a>) X(©) + N(<o), 

45 where H(a>) and N(co) are "biases" which are assumed to be relatively constant throughout each utterance se- 
quence. The present invention estimates these biases and minimizes, or removes, their effects from the con- 
taminated signal Y(co). 

Fig. 2Ais a schematic block diagram of a speech recognition system 20 incorporating the present invention. 
The distorted signal Y(©) from a telephone channel is input to a feature analysis block 22, which performs a 

so sequence of feature measurements to form a "test pattern". The feature measurements are typically the output 
of any of several known spectral analysis techniques, such as filter bank analysis or a linear predictive coding 
(LPC) analysis. Typically, a bank of bandpass filters is used to screen the speech signal, and then a micropro- 
cessor is used to process the filtered signals. The results of the feature analysis are a series of vectors that 
are characteristic of the time-varying spectral characteristics of the speech signal. A codebook of these distinct 

55 analysis vectors is usually generated by one or more microprocessors utilizing a vector quantization (VQ) tech- 
nique. Vector quantization is known in the art, and is sometimes used as a preprocessor step to perform pre- 
liminary recognition decisions in order to reduce the computational load of a recognizer 24. 

Referring again to Fig. 2A, the output signal of the feature analysis block 22, V(fi>), is used as the input to 
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asignalbfcsremova.tSBR^^ 

27. The SBR block 26 first computes an esuma e of the bias ^the speech s,gn charactenzes 
timated bias to generate a signal Y"(a>) for mput to a speech » g jmates 

the spectra, properties of the frames of the ^f^.^ SS^^>^ 
the original speech signal X(«» with the b,as °^^^^^on h the Hidden Markov Mode, 
which can be used byone or more microprocessors ^^^^*SLo 8 nlz W 24^ha«peBch 

nr^rteX^ 

As shown, the bias removal is carried out as an ' n ^ n *2K?SB^S^ to become an integrated 
in. pocws of <k*ng "II" b— . «•) «* "S£^^J^Jl»^k and cw.ua sn.l»*>. C.p- 
as the unknown parameter. The likelihood function is defined as: 



P (x\A) -n t p<*tiM 
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wnefe v-/v * X. Xrl 

1 '* ?' j" m. a _ n i-12 M> where k is the Markov 
is an observation sequence of T frames. The spee ch mode, A " ^ • " 1£ j- . f ^ Mafkov 

mode. foraspeecnuni.V. The index t^ 



v={yi.y2, ...yt. -y^ 

p(Y|o) - P(Y - b). 



and 

and then 

40 The likelihood function thus becomes: 

p(Y\b,A)=Jl m f c p(y c -b\^) 
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and the maximum likelihood bias estimator, b. is the one that achieves: 

p{Y\b.A) -"FpMb.A) • 

The solution for the maximum like.ihood bias estimate, b, can be found by using an iterative procedure. 
Consider a Gaussian local observation: 

piy t \b,XD = *,exp{-ll(y, - « - V K* - « - nfl. 

^ onH k i« the normalizing constant which does not depend on the bias 

bias vector b. each adjusted observation is: 

4 
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c = yt - b 

and its nearest neighbor can be solved for, such that: 

5 max 

z t = \i d = arg p(y t \b. A y ) 

max 



The likelihood function thus becomes: 

15 



pi y\b, A) =k exp{- -i (yt-i-^'-fyt-*-^)^ 

t 

20 where K is a constant that does not depend on the bias, b. By maximizing the quadratic function for p(V I b^A) 
with respect to the bias vector b, a unique solution for the bias estimate is guaranteed: 



1 t-1 



An updated nearest neighbor search is then conducted: 

z c = ( i i = arsr^p (yjF, A,) , 

and it is ensured that: 

35 

40 <l Kexp{-l£ <y e -F-z t )'-{y t -F-z t )> 

z t 

= p(r|F,A). 

45 

Therefore, by iteratively and interleavingly finding the best codeword, Zt, and obtaining the best "tentative" 
bias estimate, b, the likelihood function of the bias vector b is increased until a local optimal, or fixed point, 

solution for b is reached. Note that the original, distorted process Y is split into two processes X and B = 

Y _ X |f X j 3 a reasonable estimate of the undistorted signal X, and B is assumed to be stationary, It is 

then reasonable to assume B to be stationary and the maximum likelihood bias estimate b to be a good es- 
timate of the true vafue of the bias b. Note that an alternative implementation is to gradually reduce the bias 
in the signal at each frame, y t , as the iteration progresses; that is, at the n th iteration: 
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where a is def ined with: 

*C = v . flu- 1> - B"- 2)..-B°>. 

ThIsl eads to the same results o«^ 

algorithm to compute a locally optima, se t of coda ve*o* o ^g^SZL*. . blas term is included 
error. In the maximum likelihood formulat.cn ° ^ trad.Uonal vector q ? ormulation is use d except 

which is assumed to be constantly zero. *> be estimated by the maximum 

that the cerrtroids are held constant wh.le the b ^' e ^^l^ om ^ by tne generalized Lloyd algo- 



that 
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— 1 K"^ i - r* \ 

1 C-L 
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where the best codeword z, is the "nearest neighbor" to the distorted signa. spectrum y, 

*c = y, - b, 13*§T, 

whirh is a maximization of the likelihood function p(x I A) 
resulting in a tentative training -l^**** in .^JUJTJE-* number N has been reached. If not. 
described above. Next, the mdex n ,s , ctoctod I m rtapOT to se p ^ ^ 

„ is incremented by 1 in step 38. and the b as estimate .s rec P 3g ^ gecond jndex m „ 

45 speech value. This process is repeated unt.h - N. not in step 40 the first index n is reset to 
checked to see if a predetermined ""^i^"' Jecto qua ntization is performed in step 41 

so ofthebia,b,scomputod^ 

procedure is Aerated until M is ^ iSSK fS^^Si • 

Suction in the bias (also the quant.zat.on " ™2 are used for training the HMM recog- 

the present invention. . A t ,nna the testing phase of an HMM recognition system, 
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method explained above with respect to Fig. 3. However, other training techniques can be used as long as a 
codebook of centroid vectors was generated. 

Referring to Fig. 4, a contaminated speech signal 51, Y(a>), is input to the system, and feature analysis of 
the signal is performed at step 52 to generate cepstral coefficients. An index n is set equal to one at step 53, 
5 and then an estimate of the bias b is computed asdescribed above, in step 54, using the codebook 55 that 
was generated during training. The bias estimate b is then subtracted from the speech signal in step 56 to 

generate a tentative speech signal x c . Next if the index n has not reached a predetermined number N in step 
57, then n is incremented by one in step 58, the bias is recomputed in step 54 and removed from the speech 
value to form a new tentative speech signal. This process continues until the predetermined number N is 
reached at which point the resulting reduced bias signal is fed to an HMM recognizer in step 59 for processing. 
The dotted line box 46 demarcates the steps of the SBR process. Thus, the equations for computing the bias 
estimate, b, and best codeword, Zt, are repeated with the new, improved set of cent raids until the likelihood 
function reaches a fixed point. Typically, one or two iterations are adequate to ensure convergence. Note that 
the nearest neighbor search to find the best codeword, z^ could instead involve a memory structure such as 
that to be solved by the Viterbi algorithm. 

As discussed above, the distorted speech signal Y(o) contains two types of biases. These biases can be 
reduced by the above method in an integrated, iterative manner. After minimizing or removing the additive spec- 
tral bias N(co), the filtered spectral signal may be transformed into cepstrum: 

# =x + h, 

where 

* = /DF7Ilog{ X (o>)}], 
x = /DF7pog{X(a>)}], 
25 and 

h = /DF7flog{H<G>)}]. 

By applying the above two-step procedure for bias removal, using cepstrum rather than spectrum, a new 
set of centroids is generated which minimizes or removes the additive bias component This also ensures a 
maximization of the local likelihood probability. The SBR method can then be iterated several times between 
30 the spectral and cepstral domains until extraneous effects are minimized as much as possible. 

It should be noted that there is a strong relationship between signal bias removal and cepstral mean sub- 
traction (CMS). In fact CMS is equivalent to SBR when a one-codeword vector quantizer is used, where A = 
{no}- If vo is a zero vector, thus assuming that the long term cepstral average of speech is zero, then the SBR 
method is reduced to CMS, with the bias vector b representing the frame cepstral average computed over the 
35 whole utterance. 

When incorporating SBR to a platform speech recognizer, an important consideration is the look-ahead 
frame delay necessary for the estimation of the bias. The above discussion assumes that the entire utterance 
is available prior to computing the bias, which is typically not the case for real-world applications. In many prac- 
tical systems, a speech utterance is commonly analyzed on a frame-by-frame basis, or frame synchronously. 
40 A speech frame is equal to some predefined speech interval, for example, 30 millisecond sections of a speech 
utterance. Thus, in real world applications, processing is typically carried out in synchronous fashion wherein 
the first frame is processed by a first microprocessor and then the processed frame passed to one or more 
other microprocessors for recognition analysis as the first microprocessor starts to work on the next frame of 
speech. Thus, acoustic features are passed to the recognizer at every frame, instead of in a batch mode where- 
45 \ in the entire speech utterance is analyzed all at once. This process of dealing with each frame Individually is 
crucial for real-time implementation and minimal memory requirements. 

Fig. 5A is a block diagram 60, illustrating a speech recognition that can incorporate the present invention. 
-A contaminated speech signal Y(oo) is input to a first microprocessor 61 which performs feature analysis using 
"'software routines stored in a shared memory 62, which comprises both random-access and read-only memory. 
so The first microprocessor 61 also implements the SBR process of the present invention, and speech data is 

stored in the memory 62. The output speech signal x (go) from the first microprocessor is then input to a sec- 
ond microprocessor 63, which performs speech recognition to generate the text output. 

Fig. 5B is a block diagram 70 illustrating another speech recognition speech recognition apparatus which 
55 may incorporate the invention. A contaminated speech signal Y(a>)) is input to a first microprocessor 71, and 
once again a shared memory 72, comprising both random access and read-only memory, is used for data stor- 
age. A plurality of microprocessors (73, 74 to X) process the output speech signal x (©) from the first micro- 
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15 



processor to perform speech recognition renting in th ^^^^ • l^Z^l'r 
5B is typicaHy used to process a speech utterance^ on J^lS^^the present invention can 

73. 74 to X may operate on different frames as the .utterance ^ fJ~a^T . ^ ^ 

bias vector b, there are two possibles to consider. F rst a 2^^^^!?*. second pass to perform 
vector wen asother^^^^ 

y, - 2 °' th6n: B, = «.B,., + d -«)^oaaai. 



and 

= y t - Be 



where a is a weighting, coefficient. Note that the'estimate of\he deviation vector * is computed iterative*. 

20 s" chtnat: 5 = r ( „- i) ♦ B>- » + - + *< (0) . . ». 

■ < .«JLti„nrtw. bias and b> 1 > is the sequential bias estimate at iteration 
where n is the number of iterates for rees ^ tm «£ Ji at t J\ is equivalent to that computed 
n - 1 and frame t In order to ensure that the sequen ,al bases *™**«^ to performin g SSBR. 

over the whole utterance, the weighting coefficient a is set to (i i )n. uin*i_ aw 
2S S m uX a bootstrapped bias estimate or a ^J^^^ to the above equation for the 
,t shouid be realized that the same P^'^.^Z ^.average. Thus, if , denotes 
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and 

= yt -°t- 



method (see Fig. 2B). Dunng training of ft Hmatlnfl and removing the bias was repeated 

training utterance and subtracted from .t This p < ™JJ2n value, or length, of the bias was ob- 
twenty times, beyond which no s'^"^^^^ iterated four times to ensure 

ad ^uS^ » * " in the quantization 

erTor - . -.honinnutsianal sampled at 8 kHz. that was initially pre-emphasized 

The experiments were conducted with an input signal, sample i a , Eacn f rame was H am- 

(1-0.95^) and grouped into frames of **V^^^Sn7!Ld through a set of 30 tri- 

ming windowed. Fourier transformed into the "^^jjjj 1 advantage of the human au- 

cepstrum and delta-delta cepstrum "^^J^ and its first and second order time derivatives 

Besides the cepstral-based features the log of ^ energy an ^ ^ ^ com . 

were a.so computed. Thus, each speech frame was ^^^l^oU***™™. 

putation of a., the higher order coefficen ^^^^SSL^b cepstrum. 1 energy. 1 delta 
The input features, namely. 12 cepstrum. 12 delta ■ «»P»™ Quantization . The generalized Uoyd algo- 

energy and 1 delta-delta energy were ^^^^^^ otm perfeature vector. The codebook 

rithm was applied on the entire training data to 9 enerate SIX ^ 3, the energy-derived coefficients. Such 

sfcewasset to 256forthecepstrum-der^ 

codebook sizes have been shown to provide ^^^^^ cen troids was set to a maximum of 
^re P ger,^ 
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The speech recognizer was based on a discrete density HMM using whole word models, one model per 
digit (1,2..., oh, zero) and per gender, male and female. Models were left-to-right with no skip state transitions. 
A total of 24 models including silence and pause were used. The number of states for each model varied be- 
tween one state for silence, and twenty-one states for zero. (The number of states for each digit model was 
5 computed using a simplex search optimization method.) Ten iterations of the maximum likelihood estimation 
were employed during training, followed by three iterations of the maximum mutual information. The latter train- 
ing criterion was also applied for computing the exponents or weights of the six code books. The examples as- 
sume unknown length grammar and unenpointed strings both In training and testing. 

Two connected digits databases were used to evaluate the robustness characteristics of the signal bias 
w removal method. The databases were recorded over telephone lines by having individuals read digit strings 
from a predefined list. 

The first database, DB1, was collected from five diatectically distinct regions within the United States, 
namely, Long Island, Chicago, Boston, Columbus, and Atlanta. Each region consisted of 100 adult talkers (50 
males and 50 females), each speaking 66 connected digit strings from a predefined list (11 digit strings for 
is each of lengths two through seven). Half of their utterances were recorded using two electret microphone hand- 
sets, and the other half using two carbon button microphone handsets. Speech was transmitted over a long- 
distance telephone network that was either all analog, all digital or a mix, depending on the region. A subset 
of this database consisting of 14629 strings was assigned for training, and a different subset of 7073 strings 
was assigned for testing. 

20 The second database, DB2, was collected from two dialectically distinct regions, namely. Long Island and 

Boston, over a digital T-1 interface. Speech was recorded using four different microphone handsets, two elec- 
tret and two carbon button. Digit strings of lengths 10, 14, 15 and 16 digits, corresponding to credit card num- 
bers and long-distance telephone numbers, were collected from 250 adult talkers (125 males and 125 females). 
A subset of this database of 2842 strings was utilized for testing only. 

25 Training was performed on the training portion of the first database DB1, and testing was performed on' 

the testing portions of both databases DB1 and DB2. Testing on DB1 was considered as under "matched" con- 
ditions, and testing on DB2 was considered as under "mismatched" conditions. 

In order to quantify the degradation in the recognition performance when training and testing in mismatch- 
ed training and testing environmental conditions, the table of Fig. 6 shows the error rate for SBR (column 4) 

30 when using mel cep strum, for the baseline recognition. The majority of the errors were due to an increase in 
the deletion rate, although a moderate rise in the rates of substitution and insertion was observed. 

Simulation results are described below that illustrate the capabilities of the SBR method, where the com- 
putation of the bias is performed at the utterance level, and the SSBR method or sequential estimate of the 
bias, when each method is integrated as part of the baseline HMM system. Note that although the formulation 

35 of the bias removal method presented earlier is applicable to the spectral domain for noise bias removal, it 
was strictly employed in the cepstral domain here. 

A plot of the norm of the bias vector I b I at every iteration, averaged over all the training data of DB1, is 
shown in Fig. 7. Note that every time a new codebook was generated, the original "unbiased" data was used 
for recomputing the bias. This explains the sudden jump in the norm when m is incremented. When the original 

40 "unbiased" data or the "biased" data from the proceeding iteration, m - 1 , was used it led to the same results. 
Although using the latter alternative would ensure a faster convergence, it is expensive since additional mem- 
ory storage is then required for the processed data at every iteration. Further, no significant reduction in the 
norm value of the bias beyond m = 2 and n = 10 [n = iteration] was observed (see Fig. 7), and it approaches 
zero as the number of iterations increases. 

45 During recog nition, an estimate of the bias was computed for each test utterance and subtracted from the 

speech signal. Similarly this procedure was repeated twenty times. Each utterance was then passed to the 
recognizer. Fig. 7 shows the average norm of the cepstral bias for the test data of DB1 . Clearly, the norm value 
becomes approximately zero beyond ten or so iterations, approaching that of the training data. Fig. 8 shows 
the variations in the quantization error, or the average Euclidean distance, for the training and the testing data 

so at every iteration of the SBR method. The plots indicate a drop in the error by about 30% below its initial starting 
value, before applying SBR, and is only 6% above that of the training data. 

Referring again to Fig. 6, column 4 of the table shows the word error rate when introducing SBR with mel 
cepstrum. These results suggest that the SBR method is able to reduce the word error rate by as much as 
41% for mismatched training and testing conditions (DB2 with mel cepstrum). In addition, the improvement 

55 was chiefly due to a reduction in the rates of deletion and substitution by over a half, during mismatched con- 
ditions, with the insertion and deletion rates becoming relatively equal. 

A similar experiment was also conducted using CMS rather than SBR. Column 6 of Fig. 6 shows the word 
error rates for CMS when using mel cepstrum. The additional improvements that SBR provides over CMS sug- 

9 
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gests that havingafinite number of ^ 

space, as in the case of SBR, is more reward.ng *^J£™^££^S£ over CMS, for DB1 
An advantage of the SBR method over CMS . hat SOT S can JJJjW 9 column 5 (SBR , } of tne 

probability distribution, for the ^ "-^^^X^ have a Gaussian-lite shape. However, 
tribution. as v^llasthose of other coeff.c>ents have be^ 

this was generaiiy not found to be the ^^ a ^SSS^> a with higher variance and dynamic 
same ^efficient when testing on D distributions become sharper (smaller variance) 

10 following ten iterations of the SBR method. ; utterance for the estimation of the 

7" . . „._ ^, 44->^> in nut Qlnnftl. * ..... o\ ,.,'.4>K 

D t , pnui lo u^uot„.y ... r — — ■ eeostrum for the baseline sysiem v^ion... ■ 

FigM2 presents the word error rates when us,^^^ 

eitherSBR (column 3). orSSBR(column 4), orCMS (oo u tn resu|ts for both the SBR method 

during testing, while traininginduded^ 

and the SSBR method uti.ized the t ^ tmm ^^^Z 0 uL bias, as opposed to estimating the 
SarStc^ 

lihood probability function results. ir ,* finr!a t«d as Dart of a VQ (or a semi-continuous)- 

^tmSon^ 

Fig. 1 3 shows the word error rate for the SBR : method ^as a tunc ^ ^ ^ DB2 

Introducing a smarted 

the noisy estimate of the bias during the initial part of the utterance ^a fe necessary for 

Hon. rather than extraneous when applying serial 

the SSBR method to have a posrtrve im P acl ^ e ^° 9 "' P slzed codebook should be used if a minimal 
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formulation of the SBR causes a slight degradation of up to 5% in the digit error rate. 

In summary, the signal bias removal (SBR) method utilizes an iterative procedure for estimating the bias 
in the spectral and cepstral domains for the minimization of deleterious signal components in telephone speech 
recognition. The procedure is based on maximizing the likelihood of a speech mode) in which the bias is con- 

5 sidered as the unknown parameter. The SBR method, as applied in the cepstral domain only, can be integrated 
as part of a discrete density HMM system. Further, to enable real-time implementation, a sequential signal 
bias removal method (SSBR) was shown to be effective when processing speech signals on a frame-by-frame 
basis. Results from experiments using two speaker-independent databases, wherein the data from the speak- 
ers consisted of spoken strings of digits, indicate that the SBR method, when applied to a fairly long string of 

10 digits, is capable of minimizing extraneous channel distortion, and consequently improving the performance 
of telephone speech recognition. 

Further, the experimental results indicate that when introducing SBR during testing only, as opposed to 
during both training and testing, the word error rate only rises up to 14%. For CMS, this would result in a jump 
in the error rate by a factor exceeding three times. This advantage of being able to apply SBR without retraining 

15 the recognition models is desirable in all existing applications of speech recognition. 

It is to be understood that the above-described embodiments are merely illustrative, and that many vari- 
ations can be devised by those skilled in the art without departing from the scope of the invention. 



20 Claims 

1 . A method for minimizing the effect of an unknown signal bias in an input speech signal for use by a speech 
recognition system, comprising: 

(1) training the speech recognition system by using the following steps: 
25 (a) generating a set of centroids based on a training speech signal; 

(b) computing an estimate of the bias for the training speech signal based on maximizing a likelihood 
function; 

(c) subtracting the estimate of the bias from the training speech signal to obtain a tentative training 
speech value; 

30 (d) repeating steps (b) and (c) a preset number of times, wherein each subsequent computed esti- 

mate of the bias is based on the previous tentative training speech value to arrive at a reduced bias 
training speech signal value; 

(e) recomputing the centroids based on the reduced bias training speech signal to generate a new 
set of centroids; 

35 (f) repeating steps (b) to (e) a predetermined number of times to compute a processed reduced bias 

speech signal and to form an optimal set of centroids; 

(g) utilizing the optimal set of centroids and the processed reduced bias speech signal as training 
input for a speech recognizer; 

(2) testing an input speech signal to minimize the unknown bias by using the following steps: 

40 (h) utilizing the optimal set of centroids to compute an estimate of the bias for each utterance of 

the speech signal based on maximizing a likelihood function; 

(i) subtracting the estimate of the bias from the speech signal to obtain a tentative speech value; 
(|) repeating steps (h) and (i) a preset number of times, wherein each subsequent computed esti- 
mate of the bias is based on the previous tentative speech value, resulting in a reduced bias speech 
45 signal value; and 

(3) utilizing the reduced bias speech signal as input to a speech recognizer. 

2. The method of claim 1 , wherein the speech recognition system utilizes a Hidden Markov Model speech 
recognizer. 

50 

3. A method for minimizing the effect of an unknown signal bias on an input speech signal during the testing 
phase of a speech recognition system, comprising: 

(a) computing an estimate of the bias for each utterance of the speech signal based on maximizing a 
likelihood function by initially utilizing a set of centroids generated by a training model; 
55 (b) subtracting the estimate of the bias from the input speech signal to obtain a tentative speech value; 

(c) repeating steps (a) and (b) a predetermined number of times, wherein each subsequent computed 
estimate of the bias is based on the previous tentative speech value, resulting in a reduced bias speech 
signal value; and 
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(d) Sizing the reduced bias speech signal va.ue as input to a speech recognizer, 
(a)- 

A^ to ^^"-- M ---* W "* - * ,- "* i ' ,, * ,,N 

speech value; . . mirY , hor nf times wherein each subsequent com- 

(2) ISng an SSlSSS - minimize the unKnown bias by using the Mo** steps: 

( S Sng a weighting coefficient for updating a b.as value. 

(i) analyzing an utterance on a frame-by-f rame **s\s^ Qn rnaximizing 

(k) computing a sequential bias estimate .o. -.— - 

^rr^rreouentia. bias estimate from the input speech signa, at every frame to obtain 
a tentative speech value; niimber of times, wherein each subsequent com- 

recognizer. 

fhTteSng Phase of a speech recognition system. compr.s.ng: 
^J3r! weighting coefficient ^ updating a b^ vaiue. 

jdTrrngte^s 

tentative speech value; Wm : nRf4 number of times, wherein each subsequent computed 

T,.^ — — T.— .-— — — - — — — — — 

(b). 
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a shared memory means connected to the first microprocessor for storing speech data; and 
a second microprocessor means connected to the memory and connected to the output of the first 
microprocessor means, for performing speech recognition based on the generated reduced bias signal. 

5 10. The apparatus of claim 9, further comprising: 

a plurality of microprocessor means connected to the output of the first microprocessor and con- 
nected to the memory, for collectively performing speech recognition based on the generated reduced 
bias signal. 

10 



15 



20 



25 



30 



35 



40 



45 



50 



13 



EP 0 674 306 A2 



FIG. 1 



SPEECH 
IN 

XM 




FIG. 2 A 



26 



^20 



22 



25~M^ cooEBodir) 




FIG. 2B 



23 



26 



21 



29 



SPEECH 
YM 



FEATURE 




SBR 




HMM 
RECOGNIZER 


ANALYSIS 















14 



EP 0 674 306 A2 



FIG. 3 
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FIG. 5 A 
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FIG. 6 
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FIG. 12 
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